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1  Introduction 


Discrete  Markov  random  fields  (MRF’s)  defined  on  a  finite  lattice  have  seen  significant  applica¬ 
tion  as  stochastic  models  for  images  [1],  [2].  There  are  two  fundamental  problems  associated  with 
image  processing  based  on  such  random  field  models.  First,  we  want  to  generate  realizations  of  the 
random  fields  to  determine  their  suitability  as  models  of  our  prior  knowledge.  Second,  we  want  to 
collect  statistics  and  perform  optimizations  associated  with  the  random  fields  to  solve  model-based 
estimation  problems,  e.g.,  image  restoration  and  segmentation. 

According  to  the  Hammersley-Cliiford  Theorem  [3],  MRF’s  which  are  defined  on  a  lattice  are 
in  one-to-one  correspondence  with  Gibbs  distributions.  Starting  with  [4]  there  have  been  various 
constructions  of  Markov  chains  which  possess  a  Gibbs  invariant  distribution,  and  whose  common 
characteristic  is  that  their  transition  probabilities  depend  only  on  the  ratio  of  the  Gibbs  probabilities 
(and  not  on  the  normalization  constant).  These  chains  can  be  used  via  Monte  Carlo  simulation 
for  sampling  from  Gibbs  distributions  at  a  fixed  temperature,  and  for  finding  globally  minimum 
energy  states  by  slowly  decreasing  the  temperature  as  in  the  simulated  annealing  (or  stochastic 
relaxation)  method  [5],  [6].  Certain  types  of  diffusion  processes  which  also  have  a  Gibbs  invariant 
distribution  can  be  used  for  the  same  purposes  when  the  random  fields  are  continuous-valued  [7], 
[8]. 

Many  of  the  fundamental  ideas  on  MRF-based  image  processing  stem  from  [6],  which  introduced 
the  idea  of  modelling  an  image  with  a  compound  random  field  for  both  the  intensity  and  boundary 
processes.  This  prior  random  field  is  a  MRF  characterized  by  a  Gibbs  distribution.  A  measurement 
model  is  specified  for  the  observed  image,  and  the  resulting  posteriori  random  field  is  also  a  MRF 
characterized  by  a  Gibbs  distribution.  A  maximum  a  posteriori  probability  (MAP)  estimate  of  the 
image  based  on  the  noisy  observations  is  then  found  by  minimizing  the  posterior  Gibbs  energy  via 
simulated  annealing. 

There  have  been  numerous  variations  and  extensions  of  the  ideas  in  [6],  including  different  esti¬ 
mation  criteria,  different  methods  to  perform  the  annealing,  and  different  methods  to  determine  the 
random  field  parameters  [9]-[12].  We  note  that  some  of  the  alternative  estimators  that  have  been 
proposed  do  not  use  annealing  but  rather  collect  statistics  at  a  fixed  temperature,  e.g.,  the  max¬ 
imizer  of  the  posterior  marginals  (MPM)  and  the  thresholded  posterior  mean  (TPM)  estimators 
[9].  The  scope  of  the  MRF  image  models  has  also  been  enlarged  over  time.  Most  of  the  early  work 
on  Monte  Carlo  sampling  methods  and  annealing  algorithms  as  applied  to  MRF-based  image  pro¬ 
cessing  considered  finite-valued  MRF’s  (e.g.,  generalized  Ising  models)  to  model  discrete  grey  levels 
distributions  [6].  Some  more  recent  work  has  dealt  with  continuous- valued  MRF’s  (e.g.  Gauss- 
Markov  models)  to  model  continuous  grey  level  distributions  [13],  [14].  In  certain  applications  it 
may  be  advantageous  to  use  a  continuous  Gauss-Markov  random  field  model  for  computational 
and  modelling  considerations  even  when  the  image  pixels  can  actually  take  only  a  finite  (but  large) 
number  of  grey-level  values.  Both  Markov  chain  sampling  methods  and  annealing  algorithms,  and 
diffusion-type  sampling  methods  and  annealing  algorithms  have  been  used  in  continuous-valued 
MRF-based  image  processing. 

It  should  also  be  noted  that  the  annealing  algorithm  has  been  used  in  image  processing  ap¬ 
plications  to  minimize  cost  functions  not  derived  from  a  MRF  model  (c.f.  [15]  for  an  application 
to  edge  detection),  and  many  other  non-image  processing  applications  we  well.  There  has  been  a 
lot  of  research  on  the  convergence  of  discrete-state  Markov  chain  annealing  algorithms  and  diffu- 
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sion  annealing  algorithms,  but  very  few  results  are  known  about  continuous-state  Markov  chain 
annealing  algorithms. 

Our  research,  described  in  detail  in  [16]-[19],  addresses  the  following  questions: 

1.  What  is  the  relationship  between  the  Markov  chain  sampling  methods/annealing  algorithms 
and  the  diffusion  sampling  methods/annealing  algorithms? 

2.  What  types  of  convergence  results  can  be  shown  for  discrete-time  approximations  of  the 
diffusion  annealing  algorithms? 

3.  What  types  of  convergence  results  can  be  shown  for  continuous-state  Markov  chain  annealing 
algorithms? 

In  this  paper,  we  summarize  some  of  our  results.  In  Section  2  we  show  that  continuous  time 
interpolations  of  certain  Markov  chain  sampling  methods  and  annealing  algorithms  converge  weakly 
to  diffusions.  In  Section  3  we  establish  the  convergence  of  a  large  class  of  discrete  time  modified 
stochastic  gradient  algorithms  related  to  the  diffusion  annealing  algorithm.  Also  in  Section  3  we 
establish  the  convergence  of  certain  continuous-state  Markov  chain  annealing  algorithms,  essentially 
by  showing  that  they  can  be  expressed  in  the  form  of  modified  stochastic  gradient  algorithms.  This 
last  result  gives  a  unifying  view  of  the  Markov  chain  and  diffusion  versions  of  simulated  annealing 
algorithms.  In  Section  4  we  briefly  examine  some  directions  for  further  work. 


2  Convergence  of  Markov  Chain  Sampling  Methods  and  Anneal¬ 
ing  Algorithms  to  Diffusion 

In  this  section  we  analyze  the  dynamics  of  a  class  of  continuous  state  Markov  chains  which  arise 
from  a  particular  implementation  of  the  Metropolis  and  the  related  “Heat  Bath”  Markov  chain 
sampling  methods  [20].  Other  related  sampling  methods  (c.f.  [21])  can  be  analyzed  similarly.  We 
show  that  certain  continuous  time  interpolations  of  the  Metropolis  and  Heat  Bath  chains  converge 
weakly  (i.e.,  in  distribution  on  path  space)  to  Langevin  diffusions.  This  establishes  a  much  closer 
connection  between  the  Markov  chains  and  diffusions  than  just  the  fact  that  both  are  Markov 
processes  which  possess  an  invariant  Gibbs  distribution.  We  actually  show  that  the  interpolated 
Metropolis  and  Heat  Bath  chains  converge  to  the  same  Langevin  diffusion  running  at  different  time 
scales.  This  establishes  a  connection  between  the  two  Markov  chain  sampling  methods  which  is, 
in  general,  not  well  understood.  Our  results  apply  to  both  (fixed  temperature)  sampling  methods 
and  (decreasing  temperature)  annealing  algorithms. 

We  start  by  reviewing  the  discrete-state  Metropolis  and  Heat  Bath  Markov  chain  sampling 
methods.  Assume  that  the  state  space  S  is  countable.  Let  [/(•)  be  the  real-valued  energy  function 
on  S  for  the  system.  Also  let  T  be  the  (positive)  temperature  of  the  system.  Let  q{i,j)  be 
a  stationary  transition  probability  from  i  to  j  for  i,j  E  S.  The  general  form  of  the  transition 
probability  from  i  to  j  for  the  discrete-state  Markov  chains  (Afc)  we  consider  is  given  by 

PihJ)  =  =  i),  (2.1) 

where 

=  (2-2) 

j 
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s{i,j)  is  a  weighting  factor  (0  <  s{i,j)  <  1),  and  !(•)  is  an  indicator  function.  Let  [a]+  denote  the 
positive  part  of  a,  i.e.,  [a]+  =  max{a,0}.  The  weighting  factor  s{i,j)  is  given  by 


=  exp(-[C/(j)  -  U{i)]+/T) 


for  the  Metropolis  Markov  chain,  and  by 


sniij)  = 


l  +  exp{-iU{j)-U{i))/T) 


(2.3) 


(2.4) 


for  the  Heat  Bath  Markov  chain. 

Let 

7r(0  =  |exp(-C7(0/r),  Z  =  J2^xpi-Uii)/T} 

i 

(assume  Z  <  oo).  If  the  stochastic  matrix  Q  =  [g(i,  j)]  is  symmetric  and  irreducible  then  the 
detailed  balance  equation 

=  7r(i)p(i,  0,  i,j  e  S, 

is  satisfied,  and  it  follows  easily  that  7r(f),  i  G  S,  are  the  unique  stationary  probabilities  for  both 
the  Metropolis  and  Heat  Bath  Markov  chains.  Hence  these  chains  may  be  used  to  sample  from 
and  to  compute  mean  values  of  functionals  with  respect  to  a  Gibbs  distribution  with  energy  U{-) 
and  temperature  T  [22].  The  Metropolis  and  Heat  Bath  chains  can  be  interpreted  (and  simulated) 
in  the  following  manner.  Given  the  current  state  =  i,  generate  a  candidate  state  Xk  =  j  with 
probability  q(i,j).  Set  the  next  state  JVfc+i  =  j  if  s{i,j)  >  &k,  where  ©jt  is  an  independent  random 
variable  uniformly  distributed  on  the  interval  [0,1];  otherwise  set  Xk+i  =  i. 

We  can  generalize  the  discrete  state  Markov  chain  sampling  methods  described  above  to  a 
continuous  d-dimensional  Euclidean  state  space  as  follows.  Let  {/(•)  be  a  smooth  real-valued  energy 
function  on  S  =  and  let  T  be  the  (positive)  temperature.  Let  q{x,  y)  be  a  stationary  transition 
density  from  a:  to  y  for  x,  y  G  R'^.  The  general  form  of  the  transition  probability  density  for  the 
continuous-state  Markov  chain  {Xjt}  we  consider  is  given  by 


p(x,  y)  =  y(x,  y)s(x,  y)  -|-  m{x)6{y  -  x),  (2.5) 

where 

m(x)  =  1  -  y  y)s{x,  y)dy,  (2.6) 

s(i,j)  is  a  weighting  factor  (0  <  s(*,i)  <  1),  and  6(-)  is  a  Dirac-delta  function.  Here  s(-,-)  = 
«m(-,  •)  and  s(-,  •)  =  sh(-,  •)  (see  (2.3),  (2.4))  for  the  generalized  Metropolis  and  Heat  Bath  chains, 
respectively. 

The  continuous  state  Metropolis  and  Heat  Bath  Markov  chains  can  be  interpreted  (and  simu¬ 
lated)  analogously  to  the  discrete  state  versions.  In  particular  y(x,y)  is  a  conditional  probability 
density  for  generating  a  candidate  state  Xk  =  y  given  the  current  state  Xk  =  x.  For  our  analysis 
we  shall  consider  the  case  where  only  a  single  component  of  the  current  state  is  changed  to  generate 
the  candidate  state,  and  the  component  is  selected  at  random  with  all  components  equally  likely. 
Furthermore,  we  shall  require  that  the  candidate  value  of  the  selected  component  depend  only  on 
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the  current  value  of  the  selected  component.  Let  Xi  denote  the  component  of  the  vector  x  €  R-'^. 
Let  r{xi,yi)  be  a  transition  density  from  Xi  to  y;  for  Xi,yi  €  R.  Hence  we  set 

1 

y)  =  3  H  Vi)  n  ~  ^i)  (2-7) 

i=l  i#i 

Suppose  we  take 

rixi,  yi)  =  l{xi  =  -  l)S{yi  -  1)  +  l{xi  =  l)(5(y,-  +  1)  (2.8) 

In  this  case,  if  the  coordinate  of  the  current  state  Xk  is  selected  (at  random)  to  be  changed  in 
generating  the  candidate  state  Xj,,  then  Xk,i  is  ±1  when  Xk,\  is  q::!.  If,  in  addition, 

u{x)  =  -Y,Jii^iXj,  xeR^ 
i#* 

then  {Xfe}  corresponds  to  a  discrete-time  kinetic  Ising  model  with  interaction  energies  Jij  [20]. 
Suppose  instead  we  take 

rixi,  yi)  =  exp  [-(y,-  -  Xif /2a‘^  (2.9) 

In  this  case,  if  the  coordinate  of  the  current  state  Xk  is  selected  (at  random)  to  be  changed  in 
generating  the  candidate  state  Xk,  the  Xk,i  is  conditionally  Gaussian  with  mean  Xk,i  and  variance 
cr^.  In  the  sequel,  we  shall  show  that  a  family  of  interpolated  Markov  chains  of  this  type  converges 
(weakly)  to  a  Langevin  diffusion. 

For  each  e  >  0  let  rs{-,  •)  denote  the  transition  density  in  (2.9)  with  cr^  =  e,  and  let  •) 
denote  the  corresponding  transition  density  in  (2.5)-(2.7).  Let  {A'f }  denote  the  Markov  chain  with 
transition  density  Psi‘,-)  and  initial  condition  Xq  =  Xq.  Interpolate  {X^)  into  a  continuous-time 
process  {X^{t),t  <  0}  by  setting 

=  i<0 

where  [a]  is  the  largest  integer  less  than  or  equal  to  a.  Now  the  precise  definition  of  the  weak 
convergence  of  the  process  X®(-)  to  a  process  X(-)  (as  e  — 0)  is  given  in  [23].  The  significance  of 
the  weak  convergence  is  that  it  implies  not  only  the  convergence  of  the  multivariate  distribution, 
but  also  the  convergence  of  the  distributions  of  many  interesting  path  functionals  such  as  maxima, 
minima,  and  passage  times  (see  [23]  for  a  full  discussion).  To  establish  weak  convergence  here  we 
require  the  following  condition  on  (/(■): 

(A)  y(-)  is  continuously  differentiable,  and  V{7(-)  is  bounded  and  Lipshitz  continuous. 

Theorem  2.1:  Assume  (A).  Then  there  is  a  standard  d-dimensional  Wiener  process  W(-)  and 
a  process  X(-)  (with  A(0)  =  Xq  in  distribution,  nonanticipative  with  respect  to  W(-),  such  that 
X^(-)  X(‘)  weakly  as  £  — >  0,  and 

a)  for  the  Metropolis  method 

dX(t)  =  +  dW(t)  (2.10) 
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b)  for  the  Heat  Bath  method 


dX{t)  = 


^u(m) 

4T 


dt  +  dW(t) 


(2.11) 


Proof:  see  [16] 

Note  that  Theorem  2.1  justifies  our  claim  that  the  interpolated  Metropolis  and  Heat  Bath  chains 
converge  to  Langevin  diffusions  running  at  different  time  scales.  Indeed,  suppose  ¥(•)  is  a  solution 
of  the  Langevin  equation 

dY(t)  =  -V17(y(f))df  +  V^dW{t)  (2.12) 

with  y(0)  =  Xo  in  distribution.  Then  for  r(f)  =  t/2T,  y(r(-))  has  then  same  multivariate  distribu¬ 
tions  as  X  (•)  satisfying  (2.10),  while  for  r(f)  =  if  AT,  Y  (r(-))  has  the  same  multivariate  distributions 
as  X{-)  satisfying  (2.11).  Observe  that  the  limit  diffusion  (2.10)  for  the  Metropolis  chain  runs  at 
twice  the  rate  of  the  limit  diffusion  (2.11)  for  the  Heat  Bath  chain,  independent  of  the  temperature. 

To  obtain  Markov  chain  annealing  algorithms  we  simply  replace  the  fixed  temperature  T  in  the 
above  Markov  chain  sampling  methods  by  a  temperature  schedule  {Tk}  (where  typically  Tk  0). 
We  can  establish  a  weak  convergence  result  for  a  nonstationary  continuous  state  Markov  chain  of 
this  type  as  follows.  Suppose  T{-)  is  a  positive  continuous  function  on  [0,oo).  For  e  >  0  let 

n=T{ks),  k  =  o,i,... 

and  let  {X|}  be  as  above  but  with  temperature  schedules  {T^}.  It  can  be  shown  that  Theorem  2.1 
is  valid  with  T  replaced  by  T{t)  in  (2.10)  and  (2.11).  Hence  the  Markov  chain  annealing  algorithms 
converge  weakly  to  time-scaled  versions  of  the  Markov  diffusion  annealing  algorithm 

dYit)  =  -VU{Y{t))dt  +  y/2T{t)dWit)  (2.13) 

We  remark  that  there  has  been  a  lot  of  work  establishing  convergence  results  for  discrete  state 
Markov  chain  annealing  algorithms  [6],  [24]-[27],  and  also  for  the  Markov  diffusion  annealing  algo¬ 
rithm  [7],  [28],  [29].  However,  there  are  very  few  convergence  results  for  continuous  state  Markov 
chain  algorithms.  We  note  that  the  weak  convergence  of  a  continuous  state  chain  to  a  diffusion 
together  with  the  convergence  of  the  diffusion  to  the  global  minima  of  {/(•)  does  not  directly  im¬ 
ply  the  convergence  of  the  chain  to  the  global  minima  of  !/(•);  see  [30]  for  a  discussion  of  related 
issues.  However,  establishing  weak  convergence  is  an  important  first  step  in  this  regard.  Indeed,  a 
standard  method  for  establishing  the  asymptotic  (large-time)  behavior  of  a  large  class  of  discrete¬ 
time  recursive  stochastic  algorithms  involves  first  proving  weak  convergence  to  an  ODE  limit.  The 
standard  method  does  not  quite  apply  here  because  we  have  a  discrete-time  algorithm  converging 
weakly  to  a  nonstationary  SDE  limit.  But  calculations  similar  to  those  used  to  establish  the  weak 
convergence  do  in  fact  prove  useful  in  ultimately  establishing  the  convergence  of  continuous  state 
Markov  chain  annealing  algorithms,  which  is  discussed  in  Section  3.2. 
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3  Recursive  Stochastic  Algorithms  for  Global  Optimization  in 

3.1  Modified  Stochastic  Gradient  Algorithms 

In  this  section,  we  consider  a  class  of  algorithms  for  finding  a  global  minimum  of  a  smooth  function 
U{x),  X  G  Specifically,  we  analyze  the  convergence  of  a  modied  stochastic  gradient  algorithm 

=Xk-  ak{VU{Xk)  +  ik)  +  hkWk,  (3.1) 

where  is  a  sequence  of  R'^-valued  random  variables,  is  a  sequence  of  standard  d- 

dimensional  independent  Gaussian  random  variables,  and  {a^},  {6^}  are  sequences  of  positive 
numbers  with  a^,  bk  — >  0.  An  algorithm  of  this  type  arises  by  artifically  adding  the  bkWk  term  (via 
aMonte  Carlo  simulation)  to  a  standard  stochastic  gradient  algorithm 

Zk+i  =  Zk  -  ak{VU{Zk) +^k)-  (3.2) 

Algorithms  like  (3.2)  arise  in  a  variety  of  optimization  problems  including  adaptive  filtering,  iden¬ 
tification  and  control;  here  the  sequence  {^k}  is  due  to  noisy  or  imprecise  measurements  of  Vi7(  ) 
(c.f.  [31]).  The  asymptotic  behavior  of  {Zk}  has  been  much  studied.  Let  S  and  S*  be  the  set  of 
local  and  global  minimaa  of  !/(•),  respectively.  It  can  be  shown,  for  example,  that  if  {/(•)  and 
are  suitably  behaved,  ak  =  Afk  for  k  large,  and  {Zk}  is  bounded,  then  Zk  ^  S  as  k  oo  w.p.l. 
However,  in  general  Zk  A  (unless  of  course  S  =  S*).  The  idea  behind  adding  the  additional 
bkWk  term  in  (3.1)  compared  with  (3.2)  is  that  if  bk  tends  to  zero  slowly  enough,  then  possibly 
{Afc}  (unlike  {Zk})  will  avoid  getting  trapped  in  a  strictly  local  minimum  of  {/(•)  (this  is  the  usual 
reasoning  behind  simulated  annealing  type  algorithms).  We  shall  infact  show  that  if  i7(-)  and 
are  suitably  behaved,  ak  =  A/ k  and  b\  =  B/kloglogk  for  k  large  with  B/A  >  Co  (where  Co  is  a 
positive  constant  which  depends  only  on  [/(•)),  and  {Afc}  is  tight,  then  Xk  — >  5*  as  ife  oo  in 
probability.  We  also  give  a  condition  for  the  tightness  of  We  note  that  the  convergence  of 

Zk  to  S  can  be  established  under  very  weak  conditions  on  {^fc}  assuming  {Zk}  is  bounded.  Here 
the  convergence  of  Xk  to  S*  is  established  under  somewhat  stronger  conditions  on  {^jt}  assuming 
that  {Afc}  is  tight  (which  is  weaker  than  boundedness). 

The  analysis  of  the  convergence  of  {Xk}  is  usually  based  on  the  asymptotic  behavior  of  the 
associated  ordinary  differential  equation  (ODE) 

z{t)  =  -VU{z{t))  (3.3) 

(c.f.  [31], [32]).  This  motivates  our  analysis  of  the  convergence  of  {Afc}  based  on  the  asymptotic 
behavior  of  the  associated  stochastic  differential  equation  (SDE) 

dY{t)  =  -VU{Y{t))dt  -h  c{t)dWit),  (3.4) 

where  W(-)  is  a  standard  d-dimensional  Wiener  process  and  c(-)  is  a  positive  function  with  c{i)  — +  0 
as  t  — >  oo.  This  is  just  the  diffusion  annealing  algorithm  discussed  in  Section  2  (see  (2.13))  with 
T{t)  =  c^(t)/2.  The  asymptotic  behavior  of  y(t)  as  t  —>  oo  has  been  studied  intensively  by  a 
number  of  researchers.  In  [7],  [29]  convergence  results  where  obtained  by  considering  a  version 
of  (3.4)  with  a  reflecting  boundary;  in  [28]  the  reflecting  boundary  was  removed.  Our  analysis  of 
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{JVfe}  is  based  on  the  analysis  of  Y(t)  developed  in  [28]  where  the  following  result  is  proved:  if 
U{-)  is  well-behaved  and  c^(f)  =  C/logt  for  t  large  with  C  >  Co  (the  same  constant  Co  as  above) 
then  Yit)  —*  S*  as  i  —*  oo.  To  see  intuitively  how  and  y(-)  are  related,  let  if.  =  YlnZo^^n, 

ofc  =  A/k,  h\  =  B/kloglogk,  c^(t)  =  Cflogt,  and  B/A  =  C.  Note  that  hk  ~  c{tk)y/ak.  Then  we 
should  have  that 

~  y(4)- (4+1 -4)V(7(y(4))-bc(4)(i^(4+i)- 1^(4)) 

=  y(4)  -  afeVc/(y(4))  +  c(4)\/^ 

Y(tk)  -  akVU{Yitk))  +  bkVk 

where  {14}  is  a  sequence  of  standard  d-dimensional  independent  Gaussian  random  variables.  Hence 
(for  {^fc}  small  enough)  {Xk}  and  {y(4)}  should  have  approximately  the  same  distributions.  Of 
course,  this  is  a  heuristic;  there  are  significant  technical  difficulties  in  using  y(-)  to  analyze  {Xj;.} 
because  we  must  deal  with  long  time  intervals  and  slowly  decreasing  (unbounded)  Gaussian  random 
variables. 

An  algorithm  like  (3.1)  was  first  proposed  and  analyzed  in  [29].  However,  the  analysis  required 
that  the  trajectories  of  {Afc}  lie  within  a  fixed  ball  (which  was  achieved  by  modifying  (3.1)  near 
the  boundary  of  the  ball).  Hence  such  a  version  of  (3.1)  is  only  suitable  for  optimizing  [/(•)  over  a 
compact  set.  Furthermore  the  analysis  also  required  ^k  to  be  zero  in  order  to  obtain  convergence.  In 
our  first  analysis  of  (3.1)  in  [17]  we  also  required  that  the  trajectories  of  {Ajs,}  lie  in  a  compact  set. 
However,  our  analysis  did  not  require  ^k  to  be  zero,  which  has  important  implications  when  V{7(-) 
is  not  measured  exactly.  In  our  later  analysis  of  (3.1)  in  [18]  we  removed  the  requirement  that 
the  trajectories  of  {A*}  lie  in  a  compact  set.  From  our  point  of  view  this  is  the  most  significant 
difference  between  our  work  in  [18]  and  what  is  done  in  [29],  [17]  (and  more  generally  in  other 
work  on  global  optimization  such  as  [33]):  we  deal  with  unbounded  processes  and  establish  the 
convergence  of  an  algorithm  which  finds  a  global  minimum  of  a  function  when  it  is  not  specified  a 
priori  what  bounded  region  contains  such  a  point. 

We  now  state  the  simplest  result  from  [18]  concerning  the  convergence  of  the  modified  stochastic 
gradient  algorithm  (3.1).  We  will  require 

and  the  following  conditions: 

(Al)  U(-)  is  a  function  from  to  [0,oo)  such  that  the  5*  =  {a:  :  U{x)  <  {7(y)Vj/}  ^  0.  (We 
also  require  some  mild  regularity  conditions  on  {/(•);  see  [18]). 

(A2)  >  0,  lim^.^oo-*^]^  <  oo. 

(A3)  lima;-K3o  m)  ~  ^ 

(A4)  For  k  =  0, 1, ... ,  let  Tk  be  the  <r-field  generated  by 

Xq,  Wq,  . . . ,  There  exists  an  X  >  0,  at  >  —1,  and  ^  >  0  such  that 

<  LatilXkf  +  1),  \E{ik\:Fk}\  <  iaf (l^fel  +  1)  w-pA 

and  Wk  is  independent  of 
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Theorem  3.1:  Assume  (A1)-(A4)  hold.  Let  {Xk}  be  given  by  (3.1).  Then  there  exists  a  constant 
Co  such  that  for  B/A  >  Co 

Xk  S*  as  k  oo 

in  probability. 

Proof:  See  [18] 


Remarks: 

1.  The  constant  Co  plays  a  critical  role  in  the  convergence  of  Xk  as  k  oo  and  also  Y(t)  as 
t  — >  oo.  In  [28]  it  is  shown  that  the  constant  Co  (denoted  there  by  co)  has  an  interpretation 
in  terms  of  the  action  functional  for  a  family  of  perturbed  dynamical  systems;  see  [28]  for  a 
further  discussion  of  Co  including  some  examples. 

2.  It  is  possible  to  modify  (3.1)  in  such  a  way  that  only  the  lower  bound  and  not  the  upper 
bound  on  |Vf7(-)|  in  (-^^2)  is  needed  (see  [18]). 

3.  In  [18]  we  actually  separate  the  problem  of  convergence  of  into  two  parts:  one  to 

establish  tightness  and  another  to  establish  convergence  given  tightness.  This  is  analogous  to 
separating  the  problem  of  convergence  of  into  two  parts:  one  to  establish  boundedness 
and  another  to  establish  convergence  given  boundedness  (c.f.  [31]).  Now  in  [18]  the  conditions 
given  for  tightness  are  much  stronger  than  the  conditions  given  for  convergence  assuming 
tightness.  For  a  particular  algorithm  it  is  often  possible  to  prove  tightness  directly,  resulting 
in  somewhat  weaker  conditions  than  those  given  in  Theorem  3.1. 

3.2  Continuous-State  Markov  Chain  Algorithm 

In  this  section  we  examine  the  convergence  of  a  class  of  continuous-state  Markov  chain  annealing 
algorithms  similar  to  those  described  in  Section  2.  Our  approach  is  to  write  such  an  algorithm 
in  the  form  of  a  modified  stochastic  gradient  algorithm  of  (essentially)  the  type  considered  in 
Section  3.1.  A  convergence  result  is  obtained  for  global  optimization  over  all  of  Some  care 
is  necessary  to  formulate  a  Markov  chain  with  appropriate  scaling.  It  turns  out  that  writing  the 
Markov  chain  annealing  algorithm  in  (essentially)  the  form  (3.1)  is  rather  more  complicated  than 
writing  standard  variations  of  gradient  algorithms  which  use  some  type  of  (possibly  noisy)  finite 
difference  estimate  of  VU{-)  in  the  form  (3.2)  (c.f.  [31]).  Indeed,  to  the  extent  that  the  Markov 
chain  annealing  algorithm  uses  an  estimate  of  V(7(-),  it  does  so  in  a  much  more  subtle  manner  than 
a  finite  difference  approximation. 

Although  some  numerical  work  has  been  performed  with  continuous-state  Markov  chain  an¬ 
nealing  algorithms  [13],  [14],  there  has  been  very  little  theoretical  analysis,  and  furthermore  the 
analysis  of  the  continuous  state  case  does  not  follow  from  the  finite  state  case  in  a  straightforward 
way  (especially  for  an  unbounded  state  space).  The  only  analysis  we  are  aware  of  is  in  [13]  where  a 
certain  asymptotic  stability  property  is  established.  Since  our  convergence  results  for  the  continu¬ 
ous  state  Markov  chain  annealing  algorithm  are  ultimately  based  on  the  asymptotic  behavior  of  the 
diffusion  annealing  algorithm,  our  work  demonstrates  and  exploits  the  close  relationship  between 
the  Markov  chain  and  diffusion  versions  of  simulated  annealing. 
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We  shall  perform  our  analysis  of  continuous  state  Markov  chain  annealing  algorithms  for  a 
Metropolis  type  chain.  We  remark  that  convergence  results  for  other  continuous-state  Markov  chain 
sampling  method-based  annealing  algorithms  (such  as  the  Heat  Bath  method)  can  be  obtained  by 
a  similar  procedure.  Recall  that  the  1-step  transition  probability  density  for  a  continuous  state 
Metropolis-type  (fixed  temperature)  Markov  chain  is  given  by  (see  equations  (2.3),  (2.5),  (2.6)) 

p{x,  y)  =  q{x,  y)s{x,  y)  -f  m{x)d{y  -  x) 

where 

m(a:)  =  ^-  J  y)s{x,  y)dy 

and 

s{x,y)  =  exp{-[U{y)  -  U{x)]+/T). 

Here  we  have  dropped  the  subscript  on  the  weighting  factor  s{x,y).  If  we  replace  the  fixed  tem¬ 
perature  T  by  a  temperature  sequence  {!*}  we  get  a  Metropolis-type  annealing  algorithm. 

Our  goal  is  to  express  the  Metropolis-type  annealing  algorithm  as  a  modified  stochastic  gradient 
algorithm  like  (3.1)  so  as  to  establish  its  convergence.  This  leads  us  to  choosing  a  nonstationary 
Gaussian  transition  density 


qkix,y) 


1  /  ly  -  \ 


and  a  state-dependent  temperature  sequence 


2ak 


where  cr(-)  is  a  continuous  function  from  to  [1,  oo)  such  that 


(3.6) 


(3.7) 


cr{x)  ~  |a;|  as  a:  oo 

(e.g.  o-(a;)  =  1  +  |a:|  or  cr{x)  =  max{l,|ar|}  will  do).  With  these  choices  the  Metropolis-type 
annealing  algorithm  can  be  expressed  as 


^*+1  =  Xk-  ak{yU{Xk)  +  4fc)  +  bk<^{Xk)Wk  (3.8) 

for  appropriately  behaved  Note  that  (3.8)  is  not  identical  to  (3.1)  (because  cr{x)  ^  1),  but  it 
turns  out  that  Theorem  3.1  holds  for  {A'a,}  generated  by  either  (3.1)  or  (3.8).  We  remark  that  the 
state  dependent  term  cr(a:)  term  in  (3.6)  and  (3.7)  produces  a  drift  toward  the  origin  proportional 
to  |ar|,  which  is  needed  to  establish  tightness  of  the  annealing  chain. 

This  discussion  leads  us  to  the  following  continuous-state  Metropolis-type  annealing  algorithm. 
Let  JV(m,  A)  denote  d-dimensional  normal  measure  with  mean  m  and  covariance  matrix  A. 
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3.3  Continuous-State  Metropolis- Type  Annealing  Algorithm: 

Let  {Xjsr}  be  a  Markov  chain  with  1  step  transition  probability  at  time  k  given  by 


where 


and 


€  A\Xk  =  x}  =  f  Sk{x,  y)dN{x,  blt7^{x)I){y)  -|-  mfc(a;)U(ar)  (3-9) 

JA 

mk{x)  =  1  -  y  y)dN{x,  6|cr2(*)/)(y)  (3.10) 


Sfc(a^,y)  =  exp 


2ak[U{y)  -  U{x)]+ 
H  0-2  (x) 


(3.11) 


We  now  state  a  convergence  result  from  [19]  concerning  the  convergence  of  the  continuous-state 
Metropolis  type  annealing  algorithm.  Let  the  sequences  {a;;,}  and  {6^}  be  given  by  (3.5). 


Theorem  3.2:  Assume  (A1)-(A3)  hold.  Let  {X*}  be  the  Markov  chain  with  transition  probability 
given  by  (3.9)-(3.11).  Then  there  exists  a  constant  Co  such  that  for  B/A  >  Co 

Xk  —>■  S  *  as  Jk  — +  oo 


in  probability. 

Proof:  We  mentioned  above  that  Theorem  3.1  holds  for  either  (3.1)  or  (3.8).  Hence  it  is  enough  to 
show  that  defined  by  (3.8)  satisfies  (A4).  In  [19]  it  is  shown  that  there  exists  an  L  >  0  such 
that 

Emf\y^k}<L^{\Xk\^  +  l),  \E{U\Ek}\<L^i\Xk\  +  l)  W.p.l 

ak  Ok 

It  follows  that  (A4)  is  satisfied  with  a  =  —1/2  and  j3  £  (0, 1/2).  □ 

Remarks: 

1.  The  constant  Co  is  the  same  constant  described  in  Remark  1  following  Theorem  3.1 

2.  It  is  possible  to  modify  (3.9)-(3.11)  in  such  a  way  that  only  the  lower  bound  and  not  the 
upper  bound  on  |V(7(')|  from  (A2)  is  needed  (see  [19]). 


4  Conclusions 

Monte  Carlo  sampling  methods  and  annealing  algorithms  have  found  significant  application  to 
MRF-based  image  processing.  These  algorithms  fall  broadly  into  two  groups:  Markov  chain  and 
diffusion  methods.  The  discrete-state  Markov  chain  algorithms  have  been  used  with  finite  range 
MRF  models,  while  both  continuous-state  Markov  chain  and  diffusion  algorithms  have  been  used 
with  continuous  range  MRF  models.  We  note  that  there  are  some  very  interesting  questions  related 
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to  the  parallel  implementation  of  these  Monte  Carlo  procedures  which  we  have  not  discussed  here; 
see  [34]. 

In  this  paper  we  summarized  some  of  our  research  which  has  investigated  the  relationship 
between  the  various  Markov  chain  and  diffusion  sampling  methods  and  annealing  algorithms.  We 
demonstrated  the  weak  convergence  of  certain  interpolated  Markov  chain  sampling  methods  and 
annealing  algorithms  to  diffusions.  We  also  established  the  large-time  convergence  of  a  class  of 
discrete-time  modified  stochastic  gradient  algorithms  based  on  the  asymptotic  behavior  of  the 
associated  diffusion  annealing  algorithm.  We  further  established  the  large-time  convergence  of  a 
continuous-state  Markov  chain  annealing  algorithm  by  writing  it  in  the  form  of  such  a  modified 
stochastic  gradient  algorithm.  The  convergence  here  is  to  the  global  minima  of  an  energy  cost 
function  defined  on  the  entire  d-dimensional  Euclidean  space. 

It  seems  to  us  that  some  experimental  comparisons  of  continuous  state  Markov  chain  and 
diffusion-type  annealing  algorithms  (practically  implemented  by  the  modified  stochastic  gradient 
algorithms  described  above)  on  image  segmentation  and  restoration  problems  would  be  of  some 
interest.  We  are  not  aware  of  any  explicit  comparisons  of  this  type  in  the  literature.  It  might  also 
be  useful  to  examine  the  application  of  the  modified  stochastic  gradient  algorithms  to  adaptive 
pattern  recognition,  filtering  and  identification,  where  stochastic  gradient  algorithms  are  frequently 
employed.  Because  of  the  slow  convergence  of  the  modified  stochastic  gradient  algorithms,  offline 
applications  will  probably  be  required.  One  particular  application  which  might  prove  fruitful  is 
training  multilayer  feedforward  “neural  nets”,  which  is  a  nonconvex  optimization  problem  often 
plagued  with  local  minima  [35]. 
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